Abstract
This notebook is a template for evaluating COVID-19 cases forecast submissions from COVIDhub. After inputting a set of parameters (forecasters, COVID signals, etc), the template yields a comprehensive report on the predictions of COVID forecasters as well as their performance compared to the ground truth. The visualizations generated by the template offer an intuitive way to compare the accuracy of forecasters across all US states.
Every week, forecasters submit their predictions to COVID-19 ForecastHub. In this report, we rely on an AWS bucket that contains the estimates of a handful of signals (e.g., COVID death, cases, hospitalization, etc). Furthermore, the AWS server stores an array of evaluation metrics of these forecasts (e.g., Absolute Error, Weighted Interval Score, and 80% Coverage). Alternatively, the data can be retrieved from the publicly accessible covidcast and covideval APIs.
aheads <- as.numeric(params$weeks)
url_case <- "https://forecast-eval.s3.us-east-2.amazonaws.com/score_cards_state_cases.rds"
download.file(url_case, "eval_cases.RDS") # download to disk
scores1 <- readRDS(paste0(here(), "/R/eval_cases.RDS"))
scores <- subset(scores1, forecaster %in% params$forecasters)
our_pred_dates <-
scores %>%
filter(forecaster == "COVIDhub-baseline")
our_pred_dates <- unique(our_pred_dates$forecast_date)
forecast_dates <- our_pred_dates
# If, subsetting dates:
# n_dates <- length(our_pred_dates)
# forecast_dates <- our_pred_dates[n_dates- 2 *5:2]
scores$forecast_date <-
if_else(scores$forecaster %in% c("Karlen-pypm", "CU-select"), scores$forecast_date + 1, scores$forecast_date)
scores <- subset(scores, forecast_date %in% forecast_dates & ahead <= aheads)
results <- intersect_averagers(scores, c("forecaster"), c("forecast_date", "geo_value")) %>%
select(c("ahead", "geo_value", "forecaster","forecast_date", "data_source", "signal","target_end_date","incidence_period","actual","wis","ae","cov_80"))
The target forecast dates are:
2020-04-06, 2020-04-13, 2020-04-20, 2020-04-27, 2020-05-04, 2020-05-11, 2020-05-18, 2020-05-25, 2020-06-01, 2020-06-08, 2020-06-15, 2020-06-22, 2020-06-29, 2020-07-06, 2020-07-13, 2020-07-20, 2020-07-27, 2020-08-03, 2020-08-10, 2020-08-17, 2020-08-24, 2020-08-31, 2020-09-07, 2020-09-14, 2020-09-21, 2020-09-28, 2020-10-05, 2020-10-12, 2020-10-19, 2020-10-26, 2020-11-02, 2020-11-09, 2020-11-16, 2020-11-23, 2020-11-30, 2020-12-07, 2020-12-14, 2020-12-21, 2020-12-28, 2021-01-04, 2021-01-11, 2021-01-18, 2021-01-25, 2021-02-01, 2021-02-08, 2021-02-15, 2021-02-22, 2021-03-01, 2021-03-08, 2021-03-15, 2021-03-22, 2021-03-29, 2021-04-05, 2021-04-12, 2021-04-19, 2021-04-26, 2021-05-03, 2021-05-10, 2021-05-17, 2021-05-24, 2021-05-31, 2021-06-07, 2021-06-14, 2021-06-21, 2021-06-28, 2021-07-05, 2021-07-12, 2021-07-19, 2021-07-26, 2021-08-02
The template will compile data of the following forecasters:
CovidAnalytics-DELPHI, COVIDhub-baseline, COVIDhub-ensemble, COVIDhub-trained_ensemble, CU-select, Karlen-pypm.
To promote the flexibility to replicate the report, the data used in this report can be easily downloaded as a CSV file. By doing so, the user can generate customized plots or even include their own forecaster.
results %>%
download_this(
output_name = "results",
output_extension = ".csv",
button_label = "Download Cases Evaluation",
button_type = "success",
has_icon = TRUE,
csv2 = FALSE,
icon = "fa fa-save"
)
The primary metrics used in evaluating the performance of COVID forecasters are:
By Forecast Dates
The Weighted Interval Score can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness (spread) and penalties for over- and under prediction. With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.
weeks.label = c("1 week ahead", "2 weeks ahead", "3 weeks ahead", "4 weeks ahead")
names(weeks.label) = c(1, 2, 3, 4)
subtitle = sprintf("Forecasts made over %s to %s",
format(min(forecast_dates), "%B %d, %Y"),
format(max(forecast_dates), "%B %d, %Y"))
plot_wis <-
plot_canonical(results,
x = "forecast_date",
y = "wis",
aggr = mean,
grp_vars = c("forecaster","ahead"),
facet_rows = "ahead") +
labs(title = subtitle,
x = "Forecast Dates",
y = "Mean WIS",
color = "Forecasters") +
geom_point(aes(text=sprintf("Forecast Date: %s<br>Mean WIS: %s <br>Forecaster: %s",
forecast_date,
round(wis, digits = 2),
color)),
alpha = 0.05) +
facet_wrap(~ahead, nrow = 4, labeller = labeller(ahead=weeks.label)) +
theme(strip.background = element_rect(fill = "white")) +
theme_minimal() +
theme(plot.title = element_text(hjust = "center")) +
scale_y_log10()
if (params$colorblind_palette) {
plot_wis <- plot_wis +
scale_color_viridis_d()
}
ggplotly(plot_wis, tooltip="text", height=800, width= 1000) %>%
layout(hoverlabel = list(bgcolor = "white"))
By Weeks Ahead
Absolute Error is the difference between the measured value and the “true” value. The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.
plot_ae <-
plot_canonical(results,
x = "ahead",
y = "ae",
aggr = mean) +
labs(title = subtitle, x= "Weeks Ahead", y = "Mean AE",color='Forecasters') +
# geom_line(aes(linetype=forecaster, color=forecaster)) +
geom_point(aes(color=forecaster, text=sprintf("Weeks Ahead: %s<br>Average Error: %s <br>Forecaster: %s",
ahead,
round(ae, digits=2),
color)),
alpha = 0.05) +
theme_minimal() +
theme(plot.title = element_text(hjust = 0.5)) +
scale_y_log10()
if (params$colorblind_palette) {
plot_ae <- plot_ae +
scale_color_viridis_d()
}
plot_ae<- ggplotly(plot_ae, tooltip="text", width=1000)
#adjust y-axis label
plot_ae[['x']][['layout']][['annotations']][[2]][['x']] <- -0.05
plot_ae %>% layout(margin = list(l = 75), hoverlabel = list(bgcolor = "white"))
By Forecast Dates
Coverage is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.
plot_cov80 <-
plot_canonical(results,
x = "forecast_date",
y = "cov_80",
aggr = mean,
grp_vars = c("forecaster","ahead"),
facet_rows = "ahead") +
labs(title = subtitle, x= "Forecast date", y = "Mean Coverage 80", color='Forecasters') +
geom_point(aes(text = sprintf("Forecast Date: %s<br>Coverage: %s <br>Forecaster: %s",
forecast_date,
round(cov_80, digits = 2),
color)),
alpha = 0.05) +
facet_wrap(~ahead, nrow = 4, labeller = labeller(ahead = weeks.label)) +
theme(strip.background = element_rect(fill = "white")) +
theme_minimal() +
theme(plot.title = element_text(hjust = "center")) +
geom_line(mapping = aes(y = .8), )
if (params$colorblind_palette) {
plot_cov80 <- plot_cov80 +
scale_color_viridis_d()
}
plot_cov80 <- ggplotly(plot_cov80, tooltip="text", height=800, width=1000)
#adjust y-axis label
plot_cov80[['x']][['layout']][['annotations']][[2]][['x']] <- -0.05
plot_cov80 %>% layout(margin = list(l = 75), hoverlabel = list(bgcolor = "white"))
Relative to COVIDhub-baseline (primary forecaster); scale first then take the geometric mean, ignoring a few 0’s.
geom_mean <- function(x) prod(x)^(1/length(x))
#geom_mean <- exp(mean(log((x+1)/(y+1)))) #still need to figure this out
mean_wis <-
plot_canonical(results %>%
filter(wis > 0),
x = "ahead",
y = "wis",
aggr = geom_mean,
base_forecaster = "COVIDhub-baseline",
scale_before_aggr = TRUE) +
labs(title = subtitle,
x = "Weeks Ahead",
y = "Geometric mean relative WIS",
color = "Forecasters") +
geom_point(aes(text = sprintf("Weeks Ahead: %s<br>WIS: %s <br>Forecaster: %s",
ahead,
round(wis, digits = 2),
color)),
alpha = 0.05) +
theme_minimal() +
theme(plot.title = element_text(hjust = "center")) +
geom_line(mapping = aes(y = 1))
if (params$colorblind_palette) {
mean_wis <- mean_wis +
scale_color_viridis_d()
}
mean_wis <- ggplotly(mean_wis, tooltip="text", width= 1000)
#adjust y-axis label
mean_wis[['x']][['layout']][['annotations']][[2]][['x']] <- -0.05
mean_wis%>% layout(margin = list(l = 75), hoverlabel = list(bgcolor = "white"))
mean_wis_forecast_date <-
plot_canonical(results %>%
filter(wis > 0),
x = "forecast_date",
y = "wis",
aggr = geom_mean,
facet_rows = "ahead",
grp_vars = c("forecaster", "ahead"),
base_forecaster = "COVIDhub-baseline",
scale_before_aggr = TRUE) +
theme(legend.position = "bottom") +
labs(title = subtitle,
x = "Forecast date",
y = "Geometric mean relative WIS",
color = "Forecasters") +
geom_point(aes(text = sprintf("Forecast Date: %s<br>WIS: %s <br>Forecaster: %s",
forecast_date,
round(wis, digits = 2),
color)),
alpha = 0.05) +
facet_wrap(~ahead, nrow = 4, labeller = labeller(ahead = weeks.label)) +
theme(strip.background = element_rect(fill = "white")) +
theme_minimal() +
theme(plot.title = element_text(hjust = "center")) +
geom_line(mapping = aes(y = 1))
if (params$colorblind_palette) {
mean_wis_forecast_date <- mean_wis_forecast_date +
scale_color_viridis_d()
}
mean_wis_forecast_date<- ggplotly(mean_wis_forecast_date, tooltip = "text", height=800, width= 1000)
#adjust y-axis label
mean_wis_forecast_date[['x']][['layout']][['annotations']][[2]][['x']] <- -0.05
mean_wis_forecast_date %>% layout(margin = list(l = 75), hoverlabel = list(bgcolor = "white"))
To contextualize the forecast evaluations, the following tabs illustrates the performance of COVID forecasts across all US states over forecast dates and weeks ahead. Note that the results are scaled by population.
library(sf)
maps <- results %>%
group_by(geo_value, forecaster) %>%
summarise(across(wis:cov_80, mean)) %>%
left_join(animalia::state_population, by = "geo_value") %>%
mutate(across(wis:cov_80, ~ .x / population * 1e5))%>%
pivot_longer(wis:cov_80, names_to = "score") %>%
group_by(score) %>%
mutate(time_value = Sys.Date(),
r = max(value)) %>%
group_by(forecaster, .add = TRUE) %>%
group_split()
# for county prediction, set geo_type = "county"
maps <- purrr::map(maps,
~as.covidcast_signal(
.x, signal = .x$score[1],
data_source = .x$forecaster[1],
geo_type = "state"))
maps <- purrr::map(maps,
~plot(.x,
choro_col = scales::viridis_pal()(3),
range = c(0,.x$r[1])))
nfcasts <- length(unique(results$forecaster))
The Weighted Interval Score can be interpreted as a generalization of the absolute error to probabilistic forecasts and allows for a decomposition into a measure of sharpness (spread) and penalties for over- and under prediction. With certain weight settings, the WIS is an approximation of the continuous ranked probability score, and can also be calculated in the form of an average pinball loss. A smaller WIS indicates better performance.
cowplot::plot_grid(plotlist = maps[((nfcasts*2)+1):length(maps)], ncol = 3)
Absolute Error is the difference between the measured value and the “true” value. The absolute error of a forecast is the absolute value of the difference between the actual value and the point forecast. The point forecast of a model when not provided explicitly is taken to be the 50% quantile of the forecast distribution.
# original code
cowplot::plot_grid(plotlist = maps[1:nfcasts], ncol = 3)
Coverage is estimated on a particular date by computing the proportion of locations for which a forecaster’s interval includes the actual value on that date. A perfectly calibrated forecaster would have each interval’s empirical coverage matching its nominal coverage. In the plot, this corresponds to being on the horizontal black line. Overconfidence corresponds to being below the line while underconfidence corresponds to being above the line.
cowplot::plot_grid(plotlist = maps[(nfcasts+1):(nfcasts*2)], ncol = 3)